Entry Name:"UBA-CIMINIERI-MC2"
VAST
Challenge 2015
Mini-Challenge 2
Team Members:
María Ciminieri, Universidad de Buenos Aires, mciminieri@gmail.com PRIMARY
Marcos Patricio Armas Villamarín, Universidad de Buenos Aires, marcos.armas@gmail.com
José Rozanec, Universidad de Buenos Aires, jmrozanec@gmail.com
Student Team:
YES
Did you use data from both mini-challenges? NO
Analytic Tools Used:
PostgresSQL
SQL Server
Tableau
Gephi
Google spreadsheet
Approximately how many hours were spent
working on this submission in total?
300 hours
May we post your submission in the
Visual Analytics Benchmark Repository after VAST Challenge 2015 is complete? YES
Video:
UBA-CIMINIERI-MC2.wmv
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Questions
MC2.1 – Identify those IDs that
stand out for their large volumes of communication. For each of these IDs
a. Characterize
the communication patterns you see.
b. Based
on these patterns, what do you hypothesize about these IDs?
Limit your response to no more than 4 images and 300
words.
Three IDs caught our attention regarding their large number of communications: 839736, 1278894 and external.
The ID=839736 has a total of 60818 incoming and 60812 outgoing communications;
ID=1278894 has a total of 189894 incoming and 190360 outgoing communications and ID=external has 62077
incoming communications. It is important to highlight that the average number of communications for the
remaining IDs much lower: 440. Figure 1 shows the percentage that these numbers represent.
Figure 1: Percentage of communications gathered by outstanding IDs = 839736, 1278894 and external vs. the communications between the other IDs.
The ID=1278894, sends a burst of messages on a specific time: every five minutes at 12, 14, 16, 18 and 20 hours.
The average number of communication increases on each day: 644, 1030 and 1256. An interesting behaviour was found
regarding the number of outgoing and incoming messages related to IDs 1278894 and 839736: they are
practically the same (189894 received/190360 transmitted= 99.7% for ID=1278894 and 60812 received/60818
transmitted=99.9% for 839736). Therefore, we think there is an acknowledgement for each message sent/received by them.
See figure 2 for details on ID=1278894.
We also noticed that this IDs have different communication patterns: while ID=839736 receives messages which
then acknowledges, ID=1278894 initiates the communications. Based on these patterns we can state that these
two IDs are the "Park's numbers": They are the ones used by the park to inform, locate, allow interaction
(this includes messages and playing the trivia game) within both visitors and staff members. The location of
these IDs is fixed.
Figure 2: Number of messages sent and received for ID=1278894.
We have found that ID=1278894 is used for special communications (a small, fixed amount of IDs communicate with it),
while ID=839736 has no restrictions: almost all of the IDs communicate with it. There are 9430 IDs in the data set,
ID=1278894 communicates with 2521 IDs (26.73%), ID=839736 communicates with 8720 IDs (92.47%), see Figure 3.
Figure 3: IDs that communicate with IDs 1278894 and 839736.
MC2.2 – Describe up to 10
communications patterns in the data. Characterize who is communicating, with
whom, when and where. If you have more than 10 patterns to report, please
prioritize those patterns that are most likely to relate to the crime.
Limit your response to no more than 10 images and 1000
words.
Pattern 1
In the first pattern related to crime, we observe all communications regardless the destination, focusing on time and emitter location. We observed there was a reduction of the number of communications on Sunday afternoon at the Wet Land area, where the Creighton Pavilion is located and where the Olympic medals were exposed.
Between 11:00h and 16:00h on Friday and Saturday, the number of communications from this area was large. On Sunday, the second event seems to have been cancelled, what explains lower communication levels reflected on communication levels, as Figure 4 shows.
Figure 4: Number of communications per area per date and hour
and number of origin IDs starting those communications
Pattern 2
A second pattern we observed is related to ID=839736: At 12:00h Sunday, the number of communications sent from Wet Land to ID=839736 increases, and the number of IDs that sent those messages increases two. D=839736 acknowledges these messages, keeping the same response pattern (see Figure 5).
We can also see that there is a burst of messages to ID=839736, probably querying if the 14:30 show is still scheduled. This show should have taken place at Coaster Alley.
Figure 5: Communication pattern related to ID=839736.
Pattern 3
To find potential communities relevant to the crime, we excluded ID=1278894 and considered only communications held on Sunday between 11:00h and 14:00h. Using Gephi, we applied Yifan Hu graph layout on them, and found a group of 37 IDs which move together through the park while keeping a strong communication between them. The IDs are: 1742503, 1358860, 284127, 1309055, 1159870, 124441, 1635915, 2085681, 972182, 68872, 2056066, 229760, 1938686, 1570276, 1784014, 661984, 1350376, 508963, 245363, 1872848, 988181, 1041478, 611447, 1644402, 771453, 887530, 955733, 543006, 763108, 651308, 1708002, 1364488, 378256, 1390642, 1038892, 2047906, 408906.
This group has an emissions pattern that stands out (see Figure 6): they started sending each other a large number of messages from Wet Land at 11:30h, time when the Pavilion opened. We chose a stacked graph to highlight the aggregate quantity of messages as well as the synchronization within the group. Looking closely on the timestamps of each ID, we observed this messages are not sent on the same second. Based on this, we discarded a machine pattern hypothesis. Since most communications had 36 destinations, we concluded the IDs are talking between them within the same chat group.
Figure 6: Communication within the group at 11 hours Sunday morning.
Pattern 4
Following the same strategy as we did to find pattern 4, we excluded ID=1278894, considered only communications held on Sunday between 11:00h and 14:00h and applied Yifan Hu graph layout on them. We located another group which stayed all Sunday morning in Wet Land (see Figure 7). Of all groups observed, none stayed in the same area for one hour. We suspect that this group may correspond to some staff members of the park.
The IDs are: 742660, 1832566, 1449007, 1686979, 1129933, 1627468, 690747, 1089302, 1944047, 596085, 223540, 915404, 307026, 1116893, 1475013, 432971, 284117, 1768434, 1646453, 827532, 208111, 1065343, 273501, 2033274, 529543, 745640, 1325621, 182931, 264546, 384461, 464204, 801052, 1200791, 406905, 1467311, 234704.
Figure 7: Communications and its location for the group, which stayed at the Wet Land area most of Sunday morning.
Pattern 5
We have found a burst of communications to ID=external at 11:45h on Sunday (see Figure 8). These communications are sent from the Wet Land Area only. It is important to note that the number of communications and the number of IDs sending them, follow the same pattern as well as that each message comes from a different ID.
The 37 IDs that form a group presented on pattern 3, participate in the emission of these communications but they are represent a small fraction of the 240 IDs that send messages to external at 11:58.
Figure 8: Communications and the IDs that send messages to the external number, from the Wet Land area.
Pattern 6
When some important event happens, a person is most likely to reach his own friends first. With this in mind, we asked ourselves: How long would it take to spread the word that a crime just happened at Dino Fun World? Since we had no way of tracking communication’s content, we marked all the people receiving messages from Wet Land, as “acknowledged”, and included them on the list of people considered spreading the word. We also excluded communications with IDs 1278894 and 839736, to avoid noise. Since estimating information propagation with this algorithm, may be valid only when considering a short period of time, we performed the analysis iterating every minute between 11:30h and 12:10h. To calculate the ratio of IDs reached, we divided the calculated spread over estimated total IDs in park. As total IDs in park, we calculated the count of IDs that communicated on Sunday until 12:10h.
As a result we observed that messages from Wet Land propagated between 348% and 30% quicker than messages from other park areas, depending on area and moment. We estimated that 20 minutes after the crime, 60% of people in park were aware of it (see Figure 9).
Figure 9: Wet Land information propagation pattern.
Pattern 7
We discovered that messages emissions follow a topology pattern, which repeats irregularly over time on Sunday between 11:30h and 12:00h at Wet Land. To obtain such patterns, we divided communications into one minute periods. We observed moments of vigorous activity, which involved almost all the IDs (strong emission; see figure 10 (a) and (c)), followed by periods of weak activity which involved just some of the IDs in the park (weak emissions; see figure 10 (b) and (d)). The strong emission periods were followed by weak emission periods on the time observed; having strong emissions periods that last up to three minutes in a row. Strong emissions involved mainly IDs 1278894, external and 839736. For the weak emissions, there are minor local communications, and we could not identify some regular group communicating on each interval.
Figure 10: Emission topology patterns.
MC2.3 – From this data, can you hypothesize when
the crime was discovered? Describe your
rationale.
Limit your response to no more than 3 images and 300 words.
We have found a group of 37 IDs (group no. 6) that at 11:30 hours, time when the Creighton Pavilion opens after recess, start having a large number of communications within them, see Figure 11. These communications, as shown in Figure 6, seem to be rather controlled and even among all the IDs involved; which might indicate that there was a problem and it was handled in a methodological way. Given this pattern we hypothesize that the crime was discovered at 11:30 by this group of IDs.
Figure 11: number of messages sent, comparing those sent from IDs if group no. 6 and form other IDs at the same time.
The pattern of communications of group no. 6 continues up to 11:45 hours, moment in which the communications to ID=external have a peak. This peak has 283 communications from 188 different IDs (see Figure 8). These 188 IDs that communicate with ID=external, include the 37 IDs from group no. 6 that we hypothesize that are the ones that discover the crime.
This means that each one of the 188 IDs sent at least one message (and less than two), in average. We can think that at 11:45 there is a number of IDs that find out of the missing medals and contact the external ID, to give notice. The following minutes, the communications to external continues to rise in a less violent way, but keeping to increase the number of IDs starting those communications.